Topological Solitons and Folded Proteins 
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We propose that protein loops can be interpreted as topological domain- wall solitons. They 
interpolate between ground states that are the secondary structures like a- helices and /3-strands. 
Entire proteins can then be folded simply by assembling the solitons together, one after another. We 
present a simple theoretical model that realizes our proposal and apply it to a number of biologically 
active proteins including IVII, 2RB8, 3EBX (Protein Data Bank codes). In all the examples that 
we have considered we are able to construct solitons that reproduce secondary structural motifs 
such as a-helix-loop-a-helix and /3-sheet-loop-/3-sheet with an overall root-mean-square-distance 
accuracy of around 0.7 Angstrom or less for the central a-carbons, i.e. within the limits of current 
experimental accuracy. 

PACS numbers: 87.15.A-,87.15.Cc,87.14.hm 



Solitons are ubiquitous and widely studied objects that 
can be materialized in a variety of practical and theo- 
retical scenarios [l], [2]. For example solitons can be 
deployed for data transmission in transoceanic cables, 
for conducting electricity in organic polymers pLj, and 
they may also transport chemical energy in proteins [3]. 
Solitons explain the Meissner effect in superconductiv- 
ity and dislocations in liquid crystals [1 . They also 
model hadronic particles, cosmic strings and magnetic 
monopoles in high energy physics [1 and so on. The first 
soliton to be identified is the Wave of Translation that 
was observed by John Scott Russell in the Union Canal of 
Scotland. This wave can be accurately described by an 
exact soliton solution of the Korteweg-de Vries (KdV) 
equation [1 . At least in principle it can also be con- 
structed in an atomary level simulation where one ac- 
counts for each and every water molecule in the Canal, 
together with all of their mutual interactions. However, 
in such a Gedanken simulation it would probably be- 
come a real challenge to unravel the collective excitations 
that combine into the Wave of Translation without any 
guidance from the known soliton solution of the KdV 
equation since solitons can not be constructed simply by 
adding up small perturbations around some ground state: 
A (topological) soliton emerges when non-linear interac- 
tions combine elementary constituents into a localized 
collective excitation that is stable against small pertur- 
bations and cannot decay, unwrap or disentangle [I], [2]. 

In this Letter we propose that (topological) solitons 
can also explain and describe the folding of proteins into 
their native state 0], 0. We characterize a folded pro- 
tein by the Cartesian coordinates of its N central a- 
carbons, with i = 1,...,A^. For many biologically active 
proteins these coordinates can be downloaded from Pro- 
tein Data Bank (PDB) [7 . Alternatively, the protein can 
be described in terms of its bond and torsion angles that 



can be computed from the PDB data. For this we intro- 
duce the tangent vector and the binormal vector 
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Together with the normal vector = b^ x we then 
have three vectors that are subject to the discrete Frenet 
[8 equation 
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Here and are two of the standard generators of 
three dimensional rotations, explicitely in terms of the 
permutation tensor we have {T^y^ = e*-^^. From ([l]), 
([2| we can compute the bond angles tZi and the torsion 
angles using PDB data for r^. Alternatively, if we know 
these angles we can compute the coordinates r^. The 
common convention is to select the range of these angles 
so that is positive. In the continuum limit where ([2| 
becomes the standard Frenet equation for a continuous 
curve, tvi f<i{x) then corresponds to local curvature. 

As an example we consider the 35 residue villin head- 
piece protein with PDB code IVII that has been widely 
investigated, both theoretically and experimentally g]. 
For example in the state of the art simulation [5 suc- 
ceeded in producing its fold for a short time within an 
accuracy of ~ 2 — 3A. 

From the PDB data we compute the values of bond 
angles hCi and torsion angles and the result is displayed 
in Figure 1(a), where we use the (standard) convention 
that the discrete Frenet curvature is positive. In IVII 
there are three a-helices that are separated by two loops. 
When we use the PDB (NMR) convention for indexing 
the residues the first, longer, loop is located at sites 49-54 
and the second, shorter, between 59-62. 
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FIG. 1: (a) The bond and torsion angles of IVII, computed 
with the (standard) convention that the discrete Frenet cur- 
vature K is positive, (b) The Z2 gauge transformed bond and 
torsion angles. 

We shall now show that Figure 1(a) describes two soli- 
ton configurations, albeit in an encrypted form. In order 
to decrypt the data in Figure 1(a) so that these solitons 
become unveiled we observe that the equation ([2| has the 
following local Z2 gauge symmetry: At every site we can 
send 

{^i Hi • cos(Ai+i) . . 

and when we choose at each site = or = tt 
where A^ = tt is the nontrivial element of the Z2 gauge 
group, the Cartesian coordinates computed from the 
discrete Frenet equation remain intact. If we judiciously 
implement this Z2 gauge transformation in the data dis- 
played in Figure 1(a) we arrive at the apparently quite 
different Figure 1(b). Unlike in Figure 1(a), the profile 
of in Figure 1(b) clearly displays the hallmark profile 
of a topological soliton-(anti)soliton pair in a double- well 
potential: The two solitons are located around the sites 
with indices 49-54 and 59-62 which are the locations of 
the two loops in IVII. These solitons interpolate between 
the two "ground state" values Ki ^ ±7r/2 that pinpoint 
the locations of the a-helices in IVII. Moreover, the two 
downswings in the value of from the value ~ 1 that 
mark the locations of the a-helices, coincide with the lo- 
cations of the two solitons. The ensuing combined profile 
of Ki and Ti is qualitatively consistent with a double- 
well potential structure in the (a>:, r) plane that has the 
form displayed in Figure 2: When we move from left to 
right in Figure 1(b), we follow a trajectory in the (A^,r) 
plane that starts by fluctuating around the potential en- 
ergy minimum at (/^,r) ^ (— 7r/2,l) in Figure 2, corre- 
sponding to the first a-helix. The trajectory then moves 
through the first loop a.k.a. soliton (the red dashed line) 
to the second potential energy minimum i.e. a- helix at 
(/^,r) ~ (+7r/2, 1) in Figure 2, and finally back through 
the second loop a.k.a. soliton (the blue solid line) to the 
first potential energy minimum at (tz^r) = (— 7r/2, 1). 
We now present a simple theoretical model [9], [10] that 
reproduces the (A^,r) profile in Figure 1(b) as a combi- 
nation of two soliton solutions, with a very high atomary 
level accuracy for the central a-carbons. The model is 




FIG. 2: The potential energy on (k, r) plane that corresponds 
qualitatively to the data in Figure 1(b), the soliton between 
sites 49-54 corresponds to the red dashed trajectory and the 
soliton between sites 59-62 to the blue solid trajectory. 

defined by the energy functional 
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Here N is the number of central a-carbons and 
(c, m, 6, e, q) are parameters. The first sum describes 
nearest neighbor interactions along the protein. The sec- 
ond sum describes a local self-interaction of the bond 
angles. The third sum describes local interactions be- 
tween bond and torsion angles, its first term has an ori- 
gin in a Higgs effect which is due to the potential term 
in the second sum. The second term in the third sum is 
the Chern-Simons term, it is responsible for the chiral- 
ity of the protein chain. The third term is a Proca mass 
term and the last term can also be related to the Abelian 
Higgs Model, and it is also chiral. As explained in [10] 
this energy functional is essentially unique, and in partic- 
ular it can be related to a gauge invariant (supercurrent) 
version of the energy of 1+1 dimensional lattice Abelian 
Higgs Model. In three space dimensions this model is 
also known as the Ginzburg-Landau Model of conven- 
tional superconductivity [2 . Note that in Q there is 
no reference to the specifics of the interactions involving 
amino acids such as hydrophobic, hydrophilic, long-range 
Coulomb, van der Waals, saturating hydrogen bonds etc. 
interactions that are presumed to drive the folding pro- 
cess. The only explicit long-range force present in Q 
is the nearest neighbor interaction described by the first 
term. Moreover, as it stands Q depends only on six 
site-independent, homogeneous parameters. There is no 
direct reference whatsoever to the underlying in general 
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highly inhomogeneous amino acid structure of a protein. 
We argue that this becomes possible since Q supports 
soUtons that describe the common secondary structural 
motifs such as a-helix//3-strand - loop - a- helix//3- strand 
as solutions to its classical equations of motion. Further- 
more, even though the actual numerical values of the pa- 
rameters are certainly motif dependent and for long loops 
that constitute bound states of several solitons one might 
need to introduce more than six parameters, we expect 
there to be wide universality so that a given soliton with 
its relatively few parameters describes a general class of 
homologous motifs. Consequently only a relatively small 
set of parameters are needed to provide soliton templates 
for structure prediction. In fact, we propose that solitons 
are the mathematical manifestation of the experimental 
observation, that the number of different protein folds is 
surprisingly limited. The presence of solitons could then 
be the reason for the success of bioinformatics based ho- 
mology modeling in predicting native folds [4]. In order 
to quantitatively disclose the soliton solution of Q we 
start by observing that the first two sums in Q can be 
interpreted as a discrete version of the energy of the 1+1 
dimensional double well model that is known to sup- 
port the topological kink-soliton. In the continuum limit 
the kink has the analytic form [l], |2], 

n{x) — m • tanh[mvc • {x — xq)] . 

We can try to estimate the parameters m and c for each of 
the two solitons in the Figure 1(b) by a least square fitting 
where we use this continuum soliton to approximate the 
exact soliton solution of the discrete equations of motion. 
We consider here explicitly only the first soliton of IVII, 
located between (PDB index) sites 49-54. Using the sites 
46-56 we find the following least square fit 

k{x) ^ 1.4627 • tanh[2.0816(x - 52.597)] . (5) 

In order to construct r{x) we solve for its equation of 
motion in Q. The result is 



r{x) 



-2.4068 • 



1 - 0.4689 •A^2(x) 
1 - 0.4619 •/^^(x) 



(6) 



In Figure 3 we show how the data in Figure 1(b) is de- 
scribed by the approximate soliton profile ([5| , ([6| . When 
we construct the ensuing discrete curve in the three di- 
mensional space by solving ^ with for i^i and given 
by ([5| and (|6|, we reproduce the first loop of IVII with 
a surprisingly good RMSD accuracy of ~1.43 A for the 
PDB indices 46-56 which is quite remarkable, taking into 
account the simplicity of our approximation. 

In order to construct a more accurate description of 
IVII, we resort to a numerical construction of a soli- 
ton solution to the equations of motion if Q. We use 
simulated annealing that involves a Monte Carlo energy 
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FIG. 3: The PDB data for the first a-helix - loop - a-helix 
motif in IVII, on the left Ki and on the right together with 
the least square approximations ([5| and (|6| (the blue solid 
lines) . 



minimization of the energy functional 
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with a simultaneous cooling of the two (inverse) tem- 
peratures Pi and Here the first sum vanishes when 
we have a solution to the classical difference equation of 
motion of Q, and the second sum computes the RMSD 
distance between the ith a-carbon of the solution and 
the protein we wish to construct. The second term in 
^ acts like a chemical potential that selects the param- 
eters in Q so that we arrive at a soliton solution that 
corresponds to the given protein. 

We have numerically constructed the classical solutions 
of Q that describe the secondary structural motifs in 
proteins with PDB codes IVII, 2RB8 and 3EBX. The 
first one has three a-helices separated by loops, while 
the second and third have /3- strand- loop- /3-strand mo- 
tifs; Both cases can be described equally by Q, the 
only difference is that in the case of /3-strands the two 
minima of the (classical) potential in Q are located at 
(/t:, r) ~ (±l,7r). In each of the proteins that we have 
studied we have routinely been able to reproduce the sec- 
ondary structural motifs as classical soliton solutions to 
the equations of motion for Q in terms of only six param- 
eters and with an overall RMSD accuracy of around 0.7 A 
per motif which is essentially the experimental accuracy 
in X-ray crystallography and NMR; in our simulations 
the first sum in ^ decreases typically by around ten or- 
ders of magnitude indicating that the final configuration 
is a solution, essentially within numerical accuracy. Con- 
sequently at least in these proteins the secondary struc- 
tural motifs can be viewed as solitons of the model Q, 
within experimental accuracy. Since the motifs that we 
have considered are quite generic in PDB data, we have 
very little doubt that our results will continue to persist 
whenever we have loops that connect a- helices and/or 
/3-strands. And as long as the loops are not very long 
and do not describe bound states of several solitons there 
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does not appear to be any need to introduce more than 
six parameters. Work is now in progress to systemati- 
cally construct and classify the solitons that describe the 
secondary structural motifs in a large class of biologically 
active proteins. 

We have also made tentative attempts to use our soli- 
tons to reconstruct entire proteins, by naively ioining the 
solitons that describe the secondary structural motifs at 
their ends. In the case of IVII we have been able to re- 
produce in this manner the entire protein as a classical 
soliton with an overall RMSD accuracy of around 1.2 A 
and the result is shown in Figure 4. Even though the 




FIG. 4: The helix-loop-helix- loop-helix structure of the IVII 
protein (green) together with its reconstruction in terms of 
two solitons (purple). The RMSD distance between the two 
configurations is ^ 1.2 A. 

accuracy we obtain is very good, the loss of accuracy 
from ~ 0.7 A to ~ 1.2 A when we combine the two soli- 
tons suggests that we can still substantially improve the 
method of assembling an entire folded protein from its 
solitons. Work is now in progress to develop more effi- 
cient methods for assembling entire proteins from their 
solitons. 

In conclusion, we have proposed that the common sec- 
ondary structural motifs that describe loops connecting 
a- helices and/or /3-strands can be interpreted as topo- 
logical solitons, with the a-helices and /3-sheets viewed 
as ground states that are interpolated by the loops as 
solitons. Entire proteins can then be assembled simply 



by combining these solitons together one after another. 
We have also presented a model that allows us to fold 
proteins in terms of its solitons within experimental accu- 
racy. In its simplest form that we have considered here, 
the model has only six site independent but in general 
motif dependent parameters. This appears to be suffi- 
cient to describe loops that are not too long. This ob- 
servation that all the details and complexities of amino 
acids and their interactions can be summarized in so sim- 
ple terms suggests the existence of wide universality in 
protein folding, and it can be viewed as a mathematically 
precise formulation of the experimental observation that 
the number of protein conformations is far more limited 
than the number of different amino acid combinations. 
Finally, we leave it as a future challenge to expand the 
model so that it incorporates an order parameter that 
describes the local orientation of the amino acids along 
the a-carbon backbone. 
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